Welcome![Sign In][Sign Up]
Location:
Search - token java

Search list

[Internet-Network用Java编写HTML文件分析程序

Description:

Java编写HTML文件分析程序

 一、概述

    

    Web服务器的核心是对Html文件中的各标记(Tag)作出正确的分析,一种编程语言的解释程序也是对源文件中的保留字进行分析再做解释的。实际应用中,我们也经常会碰到需要对某一特定类型文件进行要害字分析的情况,比如,需要将某个HTML文件下载并同时下载与之相关的.gif.class等文件,此时就要求对HTML文件中的标记进行分离,找出所需的文件名及目录。在Java出现以前,类似工作需要对文件中的每个字符进行分析,从中找出所需部分,不仅编程量大,且易出错。笔者在近期的项目中利用Java的输入流类StreamTokenizer进行HTML文件的分析,效果较好。在此,我们要实现从已知的Web页面下载HTML文件,对其进行分析后,下载该页面中包含的HTML文件(假如在Frame中)、图像文件和ClassJava Applet)文件。

    

    二、StreamTokenizer

    

    StreamTokenizer即令牌化输入流的作用是将一个输入流中变成令牌流。令牌流中的令牌实体有三类:单词(即多字符令牌)、单字符令牌和空白(包括JavaC/C++中的说明语句)。

    

    StreamTokenizer类的构造器为: StreamTokenizer(InputStream in)

    

    该类有一些公有实例变量:ttypesvalnval ,分别表示令牌类型、当前字符串值和当前数字值。当我们需要取得令牌(即HTML中的标记)之间的字符时,应访问变量sval。而读向下一个令牌的方法是调用nextToken()。方法nextToken()的返回值是int型,共有四种可能的返回:

    

    StreamTokenizer.TT_NUMBER: 表示读到的令牌是数字,数字的值是double型,可以从实例变量nval中读取。

    

    StreamTokenizer.TT_Word: 表示读到的令牌是非数字的单词(其他字符也在其中),单词可以从实例变量sval中读取。

    

    StreamTokenizer.TT_EOL: 表示读到的令牌是行结束符。

    

    假如已读到流的尽头,则nextToken()返回TT_EOF

    

    开始调用nextToken()之前,要设置输入流的语法表,以便使分析器辨识不同的字符。WhitespaceChars(int low, int hi)方法定义没有意义的字符的范围。WordChars(int low, int hi)方法定义构造单词的字符范围。

    

    三、程序实现

    

    1HtmlTokenizer类的实现

    

    对某个令牌流进行分析之前,首先应对该令牌流的语法表进行设置,在本例中,即是让程序分出哪个单词是HTML的标记。下面给出针对我们需要的HTML标记的令牌流类定义,它是StreamTokenizer的子类:

    

    

    import java.io.*;

    import java.lang.String;

    class HtmlTokenizer extends

    StreamTokenizer {

    //定义各标记,这里的标记仅是本例中必须的,

    可根据需要自行扩充

     static int HTML_TEXT=-1;

     static int HTML_UNKNOWN=-2;

     static int HTML_EOF=-3;

     static int HTML_IMAGE=-4;

     static int HTML_FRAME=-5;

     static int HTML_BACKGROUND=-6;

     static int HTML_APPLET=-7;

    

    boolean outsideTag=true; //判定是否在标记之中

    

     //构造器,定义该令牌流的语法表。

     public HtmlTokenizer(BufferedReader r) {

    super(r);

    this.resetSyntax(); //重置语法表

    this.wordChars(0,255); //令牌范围为全部字符

    this.ordinaryChar('< '); //HTML标记两边的分割符

    this.ordinaryChar('>');

     } //end of constrUCtor

    

     public int nextHtml(){

    int token; //令牌

    try{

    switch(token=this.nextToken()){

    case StreamTokenizer.TT_EOF:

    //假如已读到流的尽头,则返回TT_EOF

    return HTML_EOF;

    case '< ': //进入标记字段

    outsideTag=false;

    return nextHtml();

    case '>': //出标记字段

    outsideTag=true;

    return nextHtml();

    case StreamTokenizer.TT_WORD:

    //若当前令牌为单词,判定是哪个标记

    if (allWhite(sval))

     return nextHtml(); //过滤其中空格

    else if(sval.toUpperCase().indexOf("FRAME")

    !=-1 && !outsideTag) //标记FRAME

     return HTML_FRAME;

    else if(sval.toUpperCase().indexOf("IMG")

    !=-1 && !outsideTag) //标记IMG

     return HTML_IMAGE;

    else if(sval.toUpperCase().indexOf("BACKGROUND")

    !=-1 && !outsideTag) //标记BACKGROUND

     return HTML_BACKGROUND;

    else if(sval.toUpperCase().indexOf("APPLET")

    !=-1 && !outsideTag) //标记APPLET

     return HTML_APPLET;

    default:

    System.out.println ("Unknown tag: "+token);

    return HTML_UNKNOWN;

     } //end of case

    }catch(IOException e){

    System.out.println("Error:"+e.getMessage());}

    return HTML_UNKNOWN;

     } //end of nextHtml

    

    protected boolean allWhite(String s){//过滤所有空格

    //实现略

     }// end of allWhite

    

    } //end of class

    

    以上方法在近期项目中测试通过,操作系统为Windows NT4,编程工具使用Inprise Jbuilder3


Platform: | Size: 1066 | Author: tiberxu | Hits:

[Crack Hacksmime_api

Description: 用java实现的对电子邮件进行加密解密及签名的算法接口,支持最新的pkcs11电子令牌标准,简单易用。-java with the realization of e-mail signatures and encryption and decryption algorithm interface, pkcs11 support the latest standards for electronic token, simple and easy.
Platform: | Size: 27035 | Author: 刘永恒 | Hits:

[ELanguagebyylscanner

Description: 此为编译原理的词法分析部分,包括源程序的录入,token,错误,符号表的输出及保存,作的不是太好-this principle to build the lexical analysis, including the source of input, token, the erroneous, Symbol Table output and storage, is not very good for!
Platform: | Size: 60416 | Author: 闫相通 | Hits:

[Crack Hacksmime_api

Description: 用java实现的对电子邮件进行加密解密及签名的算法接口,支持最新的pkcs11电子令牌标准,简单易用。-java with the realization of e-mail signatures and encryption and decryption algorithm interface, pkcs11 support the latest standards for electronic token, simple and easy.
Platform: | Size: 26624 | Author: 刘永恒 | Hits:

[JSP/JavastrutsToken

Description: 用STRUTS解决重复提交问题,程序中用两种方法解一:脱离STURTS框架,使用随机来解决。二:用STRUTS框架解决,TOKEN 的应用-STRUTS solution used to submit the issue of duplication, procedures used in two ways one solution: from STURTS framework, the use of random to resolve. Second: Using STRUTS framework solution, TOKEN Application
Platform: | Size: 1351680 | Author: wyl | Hits:

[Communicationtokenring

Description: 关于令牌环的omnet++仿真,有仿真的所有文件,只需点击exe文件即可。-Token Ring on the omnet++ Simulation, simulation of all the files there, just click the exe files.
Platform: | Size: 498688 | Author: zhongchun | Hits:

[JSP/JavaStruts1.xToken

Description: Java struts1.X token应用打包下载 Java struts1.X token应用打包下载-Java struts1.X token to download the application package application package Java struts1.X token to download Java struts1.X token to download the application package
Platform: | Size: 371712 | Author: oneself | Hits:

[TCP/IP stackControleur_Erreur

Description: token ring avec java
Platform: | Size: 67584 | Author: ahmed | Hits:

[JSP/JavaHowEasy

Description: 还是topcoder的一个问题,HowEasy编写的,用JAVA实现- Problem Statement         ***Note: Please keep programs under 7000 characters in length. Thank you Class Name: HowEasy Method Name: pointVal Parameters: String Returns: int TopCoder has decided to automate the process of assigning problem difficulty levels to problems. TopCoder developers have concluded that problem difficulty is related only to the Average Word Length of Words in the problem statement: If the Average Word Length is less than or equal to 3, the problem is a 250 point problem. If the Average Word Length is equal to 4 or 5, the problem is a 500 point problem. If the Average Word Length is greater than or equal to 6, the problem is a 1000 point problem. Definitions: Token- a set of characters bound on either side by spaces, the beginning of the input String parameter or the end of the input String parameter. Word- a Token that contains only letters (a-z or A-Z) and may end with a single period. A Word must have at lea
Platform: | Size: 1024 | Author: 姜水烈山 | Hits:

[OtherLexer

Description: 该任务需要你为数学表达式构建一个单纯的语法分析程序lexer,它将一段输入的字符串转变为token(值),token(值)里每个字符都属于数学表达式的组成部分。我们想将token(值)定义为如下几种(1)计算符号,+,-.(2)字母和下划线 (3)数字,这三者先不需要仔细区分。-In this assignment, you will build a simple lexer for arithmetic expressions. A lexer breaks an put str g to tokens, each of which is a unit of the expression. We consider the follow g as tokens: (1) the symbols,+,--‐, ( and ), (2) variable name consist g of alpha--‐ umeric characters and underscore, and (3) teger numeric constants. You do not have to dist guish between the three k ds of tokens for now.
Platform: | Size: 3072 | Author: faasewq | Hits:

[JSP/JavaLCXCompiler

Description: Java实现的词法分析程序,可以输入或直接打开一个Java语言源程序,返回token表和符号表-Java implementation of the lexical analysis program that can import or open a Java language source code directly, the return token table and symbol table
Platform: | Size: 163840 | Author: chenxin | Hits:

[JSP/Javatoken_ring

Description: Simulation of Token Ring in java
Platform: | Size: 53248 | Author: nagaria | Hits:

[JSP/Javavtd-xml-2.6-java-src

Description: VTD-XML 是一种基于 Java* 的新型开放源代码 XML 处理 API,能够解决当前 XML 处理模型的许多问题。此方案目前属于 Sourceforge* 一部分,可在此处*找到。通过本演示*,您将熟悉这些基本的概念。仅凭这一点,我们还不能认为 VTD-XML 是专门为此而设计的,因为从第一步——断词(tokenization)开始,它就引入了大量优化技术。-For XML files that don t declare entity in Document Type Declaration (e.g. SOAP), tokenization can be done by only recording starting offset and length. To make it work, one also needs to maintain XML in memory intact and un-decoded. This has led to the design of a binary encoding specification we called Virtual Token Descriptor (VTD). VTD records are 64-bit integers that encode the starting offsets, lengths, token types and nesting depths of tokens in the XML document.
Platform: | Size: 1037312 | Author: taotaoler | Hits:

[ELanguageTest

Description: 编译java语言的 词法分析器 输出token表 符号表 显示错误-Lexical analyzer
Platform: | Size: 221184 | Author: liunim | Hits:

[JSP/Javatoken.java

Description: token bucket code in java
Platform: | Size: 9216 | Author: raju | Hits:

[JSP/JavaTencentWeibo

Description: java腾讯API自动登录,并获取最新的Token-java Tencent the API automatically log, and get the latest Token
Platform: | Size: 2048 | Author: lf | Hits:

[JSP/JavaToken.java

Description: java program snippet2
Platform: | Size: 2048 | Author: rox | Hits:

[JSP/JavaToken-ring-simulation-master

Description: 令牌环仿真,使用JAVA语言描写,供大家参考-Token Ring simulation, using JAVA language description, for your reference
Platform: | Size: 24576 | Author: 徐徐 | Hits:

[JSP/Javaweixin-access-token

Description: Access_Token是微信访问的一个重要凭证,不可能频繁的与微信服务去交互获取,它的有效期是7200秒。网上很多都是PHP的源码。这是一个用Java实现的获取Access-Token并在本地临时存放的工具类。-How to get WeiChat Access_token and save in local file.
Platform: | Size: 3072 | Author: 纳特 | Hits:

[JSP/JavaJAVA-springMVC-weixin

Description: 和许多第三方服务器接入类似,微信服务器接入同样需要一个加解密的过程。当我们填好配置信息后微信服务器发起一个GET请求到我们在公众平台配置的服务器url地址,微信方将带上signature,timestamp,nonce,echostr四个参数,我们自己服务器通过拼接公众平台配置的token,以及传上来的timestamp,nonce进行SHA1加密后匹配signature,返回ture说明接入成功。(Similar to many third party server access, WeChat server access also requires a encryption and decryption process. When we fill in the configuration information after the WeChat server initiates a GET request to the server URL address in our public platform configuration, WeChat will bring signature, timestamp, nonce, echostr four parameters, our own server by splicing public platform configuration token, and timestamp pass up, nonce SHA1 encryption after matching signature, return ture access success.)
Platform: | Size: 114688 | Author: zrd001 | Hits:
« 12 »

CodeBus www.codebus.net